Uncertainty Estimation for Heatmap-based Landmark Localization
基于热图的地标定位的不确定性估计
Abstract 抽象的
Automatic anatomical landmark localization has made great strides by leveraging deep learning methods in recent years. The ability to quantify the uncertainty of these predictions is a vital ingredient needed to see these methods adopted in clinical use, where it is imperative that erroneous predictions are caught and corrected. We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds. This framework can be applied to any continuous uncertainty measure, allowing straightforward identification of the best subset of predictions with accompanying estimated error bounds. We facilitate easy comparison between uncertainty measures by constructing two evaluation metrics derived from Quantile Binning. We demonstrate this framework by comparing and contrasting three uncertainty measures (a baseline, the current gold standard, and a proposed method combining aspects of the two), across two datasets (one easy, one hard) and two heatmap-based landmark localization model paradigms (U-Net and patch-based). We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold, and offer recommendations on which uncertainty measure to use and how to use it.
近年来,利用深度学习方法,自动解剖标志定位取得了长足的进步。量化这些预测的不确定性的能力是在临床使用中采用这些方法所需的一个重要因素,在临床应用中必须捕获并纠正错误的预测。我们提出了分位数分箱,这是一种数据驱动的方法,可以根据估计误差范围的不确定性对预测进行分类。该框架可应用于任何连续不确定性测量,允许直接识别最佳预测子集以及伴随的估计误差范围。我们通过构建从分位数分箱派生的两个评估指标来促进不确定性度量之间的轻松比较。我们通过比较和对比两个数据集(一个简单,一个困难)和两个基于热图的地标定位模型范例的三种不确定性度量(基线、当前黄金标准以及结合两者的建议方法)来演示该框架(U-Net 和基于补丁的)。最后,我们说明了如何过滤掉分位数箱中捕获的严重错误预测,从而显着提高可接受误差阈值下的预测比例,并就使用哪种不确定性度量以及如何使用它提供建议。
Uncertainty estimation, landmark localization, confidence, heatmaps, U-Net
不确定性估计、地标定位、置信度、热图、U-Net
1 Introduction 1简介
\IEEEPARstart \IEEEPAR开始Automatic landmark localization is an important step in many medical image analysis methods, such image as segmentation [1] and image registration [2, 3]. An erroneous landmark prediction at an early stage of analysis will flow downstream and compromise the validity of final conclusions. Therefore, the ability to quantify the uncertainty of a prediction is a vital requirement in a clinical setting where explainability is crucial [4] and there is a human in-the-loop to correct highly uncertain predictions [5].
自动地标定位是许多医学图像分析方法中的重要步骤,例如图像分割[1]和图像配准[2, 3]。分析早期阶段错误的里程碑预测将流向下游并损害最终结论的有效性。因此,在可解释性至关重要的临床环境中,量化预测不确定性的能力是至关重要的要求[4],并且有人在环中纠正高度不确定的预测[5]。
In this study we propose Quantile Binning, a data-driven framework to estimate a prediction’s quality by learning the relationship between any continuous uncertainty measure and localization error. Using the framework, we place predictions into bins of increasing subject-level uncertainty and assign each bin a pair of estimated localization error bounds. The boundaries of the bins are derived from a trained landmark localization model. These bins can be used to identify the subsets of predictions with expected high or low localization errors, allowing the user to make a choice of which subset of predictions to use based on their expected error bounds. Our method is agnostic to the particular uncertainty metric used, as long as it is continuous and the true function between the uncertainty metric and localization error is monotonically increasing. We showcase our method using three uncertainty measures: a baseline derived from predicted heatmap activations, the gold standard of ensemble model prediction variance [6], as well as introducing our own measure based on ensemble heatmap activations. Furthermore, we introduce two uncertainty evaluation methods, measuring how well an uncertainty measure truly predicts localization error and the accuracy of our predicted error bounds.
在本研究中,我们提出了分位数分箱,这是一种数据驱动的框架,通过学习任何连续不确定性度量和定位误差之间的关系来估计预测的质量。使用该框架,我们将预测放入主题级别不确定性不断增加的箱中,并为每个箱分配一对估计的定位误差范围。箱的边界源自经过训练的地标定位模型。这些箱可用于识别具有预期高或低定位误差的预测子集,允许用户根据其预期误差范围选择要使用的预测子集。我们的方法与所使用的特定不确定性度量无关,只要它是连续的并且不确定性度量和定位误差之间的真实函数是单调递增的。我们使用三种不确定性度量来展示我们的方法:从预测的热图激活得出的基线、集成模型预测方差的黄金标准[6],以及引入我们自己的基于集成热图激活的度量。此外,我们引入了两种不确定性评估方法,测量不确定性测量对定位误差的真实预测程度以及预测误差范围的准确性。
We explore the efficacy of our three uncertainty metrics on two paradigms of localization models: an encoder-decoder U-Net that regresses Gaussian heatmaps [7], and a patch-based network that generates a heatmap from patch voting, PHD-Net [8]. We compare how the same uncertainty measures perform under the two different approaches to landmark localization on two datasets of varying difficulty, finding promising results for both paradigms. Our Quantile Binning method is generalisable to any continuous uncertainty measure, and the examples we investigate in this study can be applied as a post-processing step to any heatmap-based landmark localization method. We aspire that this work can be used as a framework to build, evaluate and compare uncertainty metrics in landmark localization beyond those demonstrated in this paper.
我们探索了三个不确定性度量在两种本地化模型范例上的功效:回归高斯热图的编码器-解码器 U-Net [7],以及从补丁投票生成热图的基于补丁的网络 PHD-Net [8] ]。我们比较了相同的不确定性度量在不同难度的两个数据集上的两种不同的地标定位方法下的表现,为这两种范例找到了有希望的结果。我们的分位数分箱方法可推广到任何连续不确定性测量,并且我们在本研究中研究的示例可以作为后处理步骤应用于任何基于热图的地标定位方法。我们希望这项工作可以作为一个框架来构建、评估和比较地标定位中的不确定性指标,超出本文所演示的范围。
2 Related Work 2相关工作
2.1 Landmark Localization 2.1地标定位
The recent advancement in machine learning has led to convolutional neural networks (CNNs) dominating the task of landmark localization. Encoder-decoder methods, originally proposed for the task of image segmentation [7], have cemented themselves as one of the leading approaches for landmark localization in both the medical domain [9, 10, 11] and computer vision [12, 13, 14]. The architecture of these methods allow the analysis of images at multiple resolutions, learning to predict a Gaussian heatmap centred around the predicted landmark location. The activation of each pixel in the heatmap can be interpreted as showing the pseudo-probability of that pixel being the target landmark. The network learns to generate a high response near the landmark, smoothly attenuating the responses in a small radius around it. Regressing heatmaps proves more effective than regressing coordinates [15], as the heatmap image offers smoother supervision than direct coordinate values, and also models some uncertainty in the prediction.
机器学习的最新进展导致卷积神经网络(CNN)在地标定位任务中占据主导地位。编码器-解码器方法最初是为图像分割任务而提出的[7],现已成为医学领域[9,10,11]和计算机视觉[12,13,14]中地标定位的主要方法之一。 ]。这些方法的架构允许以多种分辨率分析图像,学习预测以预测的地标位置为中心的高斯热图。热图中每个像素的激活可以解释为显示该像素作为目标地标的伪概率。网络学会在地标附近生成高响应,平滑地衰减其周围小半径内的响应。事实证明,回归热图比回归坐标更有效[15],因为热图图像提供比直接坐标值更平滑的监督,并且还对预测中的一些不确定性进行建模。
图 1:我们的通用分位数分箱框架概述。 a)我们使用基于热图的地标定位模型进行预测,b)提取连续的不确定性度量。 c) 我们学习阈值,将预测分类为不确定性不断增加的箱,并估计每个箱的误差范围。 d)我们从高不确定性箱中过滤掉预测,以提高可接受预测的比例。 e) 最后,我们评估每个不确定性度量捕获真实误差分位数的能力和估计误差范围的准确性。
However, in medical imaging the number of available training samples is often small so the encoder-decoder network is forced to be shallow, compromising its performance [15]. One method to overcome this is via a patch-based approach; alleviating the problem by sampling many small “patches’ from an image, learning the relationship between each patch and the target landmark [16, 17]. This approach can generate orders of magnitude more training samples from a single image compared to the encoder-decoder style methods. Furthermore, patch-based models that use Fully Convolutional Networks (FCN) have fewer parameters than encoder-decoder architectures, decreasing computational requirements and training times [8].
然而,在医学成像中,可用训练样本的数量通常很小,因此编码器-解码器网络被迫变浅,从而损害了其性能[15]。克服这个问题的一种方法是通过基于补丁的方法;通过从图像中采样许多小“补丁”,学习每个补丁与目标地标之间的关系来缓解问题[16, 17]。与编码器-解码器风格的方法相比,这种方法可以从单个图像生成更多数量级的训练样本。此外,使用全卷积网络(FCN)的基于补丁的模型比编码器-解码器架构具有更少的参数,从而减少了计算要求和训练时间[8]。
Noothout et al. [18] implemented a patch-based network using an FCN to jointly perform classification and regression on each patch. The coarse binary classification task determines whether a patch contains the landmark, and the more precise regression task estimates the displacement from the patch to the landmark. This multi-task, joint learning leads to a light-weight network and enhanced localization performance, with the two tasks sharing a feature representation that improves the performance of both [19]. However, the resulting network has a strong local focus and is also susceptible to failure if the predicted patch from the classification task is incorrect. In a follow-up work, Noothout et al. [20] extended their work [18] into a two-stage method: they first train a CNN to provide global estimates for the landmarks, then employ specialised CNNs for each landmark for the final prediction. This method improves upon the first in terms of localization error, but has the drawback of requiring multiple training stages.
努索特等人。 [18]使用FCN实现了基于补丁的网络,对每个补丁联合执行分类和回归。粗略的二元分类任务确定补丁是否包含地标,更精确的回归任务估计从补丁到地标的位移。这种多任务联合学习带来了轻量级网络并增强了定位性能,两个任务共享一个特征表示,从而提高了两者的性能[19]。然而,生成的网络具有很强的局部焦点,如果分类任务的预测补丁不正确,也容易失败。在后续工作中,Noothout 等人。 [20]将他们的工作[18]扩展为两阶段方法:他们首先训练 CNN 来提供地标的全局估计,然后为每个地标使用专门的 CNN 进行最终预测。该方法在定位误差方面改进了第一种方法,但具有需要多个训练阶段的缺点。
To mitigate the inherent local focus of the patch-based methods, we extended the patch-based network [18] by borrowing heatmap regression from the encoder-decoder networks; reforming the binary classification task as a Gaussian heatmap regression task [8]. Named PHD-Net (Patch Heatmap & Displacement), this smoother supervision improved performance, reducing misidentifications compared to using the classification branch from the prior work [18]. Furthermore, we introduced the method Candidate Smoothing, combining the features from the two branches to output more accurate predictions along with an uncertainty measure.
为了减轻基于补丁的方法固有的局部焦点,我们通过借用编码器-解码器网络的热图回归来扩展基于补丁的网络[18];将二元分类任务改造为高斯热图回归任务[8]。与使用之前工作中的分类分支相比,这种更平滑的监督提高了性能,减少了错误识别,被命名为 PHD-Net(补丁热图和位移)[18]。此外,我们引入了候选平滑方法,结合两个分支的特征来输出更准确的预测以及不确定性度量。
2.2 Uncertainty Estimation
2.2 不确定性估计
Estimating the uncertainty of machine learning predictions is a topic of growing interest, particularly relevant in the domain of medical imaging where there is often a human in the loop to manually correct flagged predictions. A concentrated effort in uncertainty estimation has been applied to image segmentation by the community, a task similar to landmark localization that instead aims to predict a mask for an entire structure rather than a single point. Some of the most successful approaches use Bayesian approximation methods like Monte-Carlo dropout [21] or an ensemble of networks [22], using the variance of repeated predictions as an indicator for uncertainty. The gold standard approach across many studies is to use an ensemble of networks. This method affords better performance [22] and a more accurate mechanism for Bayesian marginalization [23] compared to a single model using Monte-Carlo dropout. However, the obvious drawback of ensembles is the need to train multiple models.
估计机器学习预测的不确定性是一个越来越令人感兴趣的话题,特别是在医学成像领域,其中经常有人手动纠正标记的预测。社区已将不确定性估计的集中精力应用于图像分割,这是一项类似于地标定位的任务,旨在预测整个结构而不是单个点的掩模。一些最成功的方法使用贝叶斯近似方法,例如蒙特卡洛退出[21]或网络集成[22],使用重复预测的方差作为不确定性的指标。许多研究的黄金标准方法是使用网络集合。与使用蒙特卡罗 dropout 的单个模型相比,该方法提供了更好的性能 [22] 和更准确的贝叶斯边缘化机制 [23]。然而,集成的明显缺点是需要训练多个模型。
In landmark localization we are predicting a single point rather than a mask, but similar uncertainty estimation approaches can be utilised. However, there are limited works exploring uncertainty in landmark localization. Payer et al. [24] directly modeled uncertainty during training by learning the Gaussian covariances of target heatmaps, and predicting the distribution of likely locations of the landmark at test time. Lee et al. [25] borrowed from image segmentation approaches by proposing a Bayesian CNN that utilised Monte-Carlo dropout to predict the location and subject-level uncertainty of cephalometric landmarks. Another method to measure the subject-level confidence (the inverse of uncertainty) of a heatmap-based landmark prediction is to measure the maximum heatmap activation (MHA) of the predicted heatmap. Since the activation of a Gaussian heatmap at a particular pixel represents the pseudo-probability of the pixel being the landmark, we can use this pseudo-probability as a confidence measure: the higher the activation, the more certain the prediction. Drevicky et al. [6] compared MHA with ensemble and Monte-Carlo dropout methods, finding MHA surprisingly effective given its simplicity. However, similarly to image segmentation, they found using an ensemble of models was best at predicting uncertainty. They calculated the coordinate prediction variance between an ensemble of models, and found this method performed best at estimating prediction uncertainty.
在地标定位中,我们预测单个点而不是掩模,但可以使用类似的不确定性估计方法。然而,探索地标定位不确定性的工作有限。佩耶等人。 [24]通过学习目标热图的高斯协方差,并预测测试时地标的可能位置的分布,直接对训练期间的不确定性进行建模。李等人。 [25]借鉴图像分割方法,提出了贝叶斯 CNN,利用蒙特卡罗 dropout 来预测头影测量地标的位置和主体级不确定性。测量基于热图的地标预测的主体级置信度(不确定性的倒数)的另一种方法是测量预测热图的最大热图激活(MHA)。由于高斯热图在特定像素处的激活表示该像素作为地标的伪概率,因此我们可以使用该伪概率作为置信度度量:激活越高,预测就越确定。德雷维奇等人。 [6] 将 MHA 与集成和蒙特卡罗 dropout 方法进行了比较,发现 MHA 由于其简单性而出奇地有效。然而,与图像分割类似,他们发现使用模型集合最适合预测不确定性。他们计算了模型集合之间的坐标预测方差,并发现该方法在估计预测不确定性方面表现最佳。
In our earlier work utilising the patch-based model PHD-Net, MHA was also used as the uncertainty metric [8]. However, the heatmap analysed is distinctly different from the heatmaps predicted by encoder-decoder networks. Rather than explicitly learning a Gaussian function centred around the landmark, the approach combined patch-wise heatmap and displacement predictions. We produced a new non-Gaussian heatmap, where the activation of each pixel is defined by the number of patches that voted for it, regularised by the coarse global likelihood prediction. Therefore, the resulting heatmap represents patch-wise ensemble votes rather than a Gaussian function, where the MHA is the pixel with the most “patch votes”.
在我们早期使用基于补丁的模型 PHD-Net 的工作中,MHA 也被用作不确定性度量 [8]。然而,分析的热图与编码器-解码器网络预测的热图明显不同。该方法没有明确学习以地标为中心的高斯函数,而是结合了分块热图和位移预测。我们生成了一个新的非高斯热图,其中每个像素的激活由投票给它的补丁数量定义,并通过粗略全局似然预测进行正则化。因此,生成的热图代表补丁方式的集成投票,而不是高斯函数,其中 MHA 是具有最多“补丁投票”的像素。
To the best of our knowledge, no study has investigated how heatmap-based uncertainty estimation measures can be used to filter out poor predictions in landmark localization. Furthermore, no general framework has been proposed to compare how well uncertainty measures can predict localization error - an important practical application in clinical settings.
据我们所知,还没有研究调查如何使用基于热图的不确定性估计措施来过滤地标定位中的不良预测。此外,还没有提出通用框架来比较不确定性测量如何预测定位误差——这是临床环境中的重要实际应用。
3 Contributions 3贡献
In this paper, we propose a general framework to compare and evaluate uncertainty measures for landmark localization. This work extends the analysis of MHA in [8], with more in depth experiments and comparisons. Our contributions, depicted in Fig. 1, are threefold:
在本文中,我们提出了一个通用框架来比较和评估地标定位的不确定性度量。这项工作扩展了[8]中对 MHA 的分析,进行了更深入的实验和比较。我们的贡献如图 1 所示,有三个方面:
-
•
We propose an ensemble-based method to extract landmark coordinates from an ensemble of heatmaps and estimate prediction uncertainty: Ensemble Maximum Heatmap Activation (E-MHA) (Fig. 1a, 1b, Sec. 4.2).
• 我们提出了一种基于集合的方法,从热图集合中提取地标坐标并估计预测不确定性:集合最大热图激活(E-MHA)(图1a、1b,第4.2 节)。 -
•
We propose Quantile Binning, a method to categorise predictions by any continuous uncertainty measure, and estimate error bounds for each bin (Fig. 1c, Sec. 4.3).
• 我们提出分位数分箱,这是一种通过任何连续不确定性度量对预测进行分类的方法,并估计每个箱的误差范围(图1c,第4.3 节)。 -
•
We construct two evaluation metrics for uncertainty estimation methods from Quantile Binning: 1) Similarity between predicted bins and true error quantiles; 2) Accuracy of estimated error bounds. (Fig. 1e, Sec. 4.4).
• 我们为分位数分位数的不确定性估计方法构建了两个评估指标: 1) 预测分位数和真实误差分位数之间的相似性; 2) 估计误差范围的准确性。 (图 1e,第 4.4 节)。
We demonstrate the impact of our contributions by using our proposed Quantile Binning to compare E-MHA to two existing coordinate extraction and uncertainty estimation methods: a baseline of Single Maximum Heatmap Activation (S-MHA), and a gold standard of Ensemble Coordinate Prediction Variance (E-CPV). In Sec. 6.2, we compare the baseline coordinate extraction performance of the three approaches, followed by the uncertainty estimation performance in Sec. 6.3. We explore the reach of heatmap-based uncertainty measures by demonstrating they are applicable to both U-Net regressed Gaussian heatmaps and patch-based voting heatmaps. We show each uncertainty measure can identify a subset of predictions with significantly lower mean error than the full set by filtering out predictions from high uncertainty bins (Fig. 1c). Finally, in Sec. 7.2 we make recommendations for which uncertainty measure to use, and how to use it.
我们通过使用我们提出的分位数分级将 E-MHA 与两种现有的坐标提取和不确定性估计方法进行比较来证明我们贡献的影响:单一最大热图激活 (S-MHA) 的基线和集合坐标预测方差的黄金标准(E-CPV)。在秒。 6.2,我们比较了三种方法的基线坐标提取性能,然后是第 6.2 节中的不确定性估计性能。 6.3.我们通过证明基于热图的不确定性度量适用于 U-Net 回归高斯热图和基于补丁的投票热图来探索基于热图的不确定性度量的范围。我们表明,通过从高不确定性箱中过滤掉预测,每个不确定性度量都可以识别出平均误差显着低于整个预测集的预测子集(图 1c)。最后,在第二节。 7.2 我们就使用哪种不确定性度量以及如何使用它提出建议。
We provide an open source implementation of this work along with the data to reproduce our results at https://github.com/pykale/pykale/tree/main/examples/landmark_uncertainty.
我们在 https://github.com/pykale/pykale/tree/main/examples/landmark_uncertainty 上提供了这项工作的开源实现以及重现我们结果的数据。
4 Methods 4方法
4.1 Landmark Localization Models
4.1 地标本地化模型
First, we briefly review the two models we use for landmark localization, allowing us to compare the generalisability of our uncertainty measures across different heatmap generation approaches. We implement a variation of the popular encoder-decoder networks that regresses Gaussian heatmaps, U-Net [7]. We also implement a patch-based method, PHD-Net [8], which produces a heatmap from patch votes.
首先,我们简要回顾用于地标定位的两个模型,使我们能够比较不同热图生成方法的不确定性度量的通用性。我们实现了一种流行的编码器-解码器网络的变体,它可以回归高斯热图,U-Net [7]。我们还实现了一种基于补丁的方法,PHD-Net [8],它根据补丁投票生成热图。
4.1.1 Encoder-Decoder Model (U-Net)
4.1.1 编码器-解码器模型(U-Net)
The vast majority of state-of-the-art landmark localization approaches are based on the foundation of a U-Net style encoder-decoder architecture. The architecture of U-Net follows a “U” shape, first extracting features at several downsampled resolutions, before rebuilding to the original dimensionality in a symmetrical upsampling path. Skip connections are employed between each level, preserving spatial information. The rationale behind the architecture design is to inject some inductive bias into the model architecture itself, helping it learn the local characteristics of each landmark, while preserving the global context.
绝大多数最先进的地标定位方法都基于 U-Net 风格的编码器-解码器架构的基础。 U-Net 的架构遵循“U”形,首先以几个下采样分辨率提取特征,然后在对称上采样路径中重建到原始维度。每个级别之间采用跳跃连接,保留空间信息。架构设计背后的基本原理是向模型架构本身注入一些归纳偏差,帮助其学习每个地标的局部特征,同时保留全局背景。
Rather than regressing coordinates directly, the objective of the model is to learn a Gaussian heatmap image for each landmark, with the centre of the heatmap on the target landmark. For each landmark with 2D coordinate position , the 2D heatmap image is defined as the 2D Gaussian function:
该模型的目标不是直接回归坐标,而是学习每个地标的高斯热图图像,其中热图的中心位于目标地标上。对于每个具有 2D 坐标位置 的地标 ,2D 热图图像 被定义为 2D 高斯函数:
| (1) |
The network learns weights and biases to predict the heatmap . During inference, we can interpret the activation of each pixel in the predicted heatmap as the pseudo-probability of that pixel being the landmark. We will exploit this in our uncertainty estimation methods.
网络学习权重 和偏差 来预测热图 。在推理过程中,我们可以将预测热图中每个像素的激活解释为该像素作为地标的伪概率。我们将在我们的不确定性估计方法中利用这一点。
4.1.2 Patch-based Model (PHD-Net)
4.1.2 基于补丁的模型(PHD-Net)
Patch-based models use a Fully Convolutional Network (FCN), with the architecture resembling the first half of an encoder-decoder architecture. Therefore, they are more light-weight than encoder-decoder networks, with significantly less parameters leading to faster training.
基于补丁的模型使用全卷积网络(FCN),其架构类似于编码器-解码器架构的前半部分。因此,它们比编码器-解码器网络更轻量,参数显着减少,训练速度更快。
In our earlier work, we proposed PHD-Net: a multi-task patch-based network [8], building on the work by Noothout et al. [18]. We incorporate a variant of the heatmap objective function from encoder-decoder networks into the objective function, predicting the 2D displacement from each patch to the landmark alongside the coarse Gaussian pseudo-probability of each patch.
在我们早期的工作中,我们提出了 PHD-Net:一个基于补丁的多任务网络 [8],以 Noothout 等人的工作为基础。 [18]。我们将编码器-解码器网络的热图目标函数的变体合并到目标函数中,预测从每个补丁到地标的二维位移以及每个补丁的粗略高斯伪概率。
PHD-Net aggregates the patch-wise predictions to obtain a heatmap by plotting candidate predictions from the displacement branch as small Gaussian blobs, then regularising the map by the upsampled Gaussian from the heatmap branch.
PHD-Net 将位移分支的候选预测绘制为小高斯斑点,然后通过热图分支的上采样高斯对图进行正则化,从而聚合补丁预测以获得热图。
Again, we can consider the activation of each pixel in heatmap as an indicator for uncertainty, where instead of the psuedo-probability, the activation represents the amount of “patch votes”.
同样,我们可以将热图中每个像素的激活视为不确定性的指标,其中激活代表“补丁投票”的数量,而不是伪概率。
4.1.3 Ensemble Models 4.1.3 集成模型
Using an ensemble of models is more robust than using a single model, as it reduces the effect of a single model becoming stuck in a local minima during training. Furthermore, we can use the variance in the predictions of each model to estimate the uncertainty of the prediction [6]. We use an ensemble of models where each model is identical, except the random initialisation of parameters.
使用模型集合比使用单个模型更稳健,因为它减少了单个模型在训练期间陷入局部最小值的影响。此外,我们可以使用每个模型预测的方差来估计预测的不确定性[6]。我们使用 模型的集合,其中每个模型都是相同的,除了参数的随机初始化之外。
4.2 Estimating Uncertainty and Coordinate Extraction
4.2估计不确定性和坐标提取
Although generated differently, we hypothesise both U-Net and PHD-Net produce heatmaps containing useful information to quantify a prediction’s uncertainty - but are they equally effective? To this end, we compare the performance of both models under three uncertainty estimation methods: a baseline approach, our proposed approach extending the baseline to an ensemble of networks, and the gold standard approach. Each method extracts coordinate values from the predicted heatmap, and estimates the prediction’s uncertainty.
尽管生成方式不同,但我们假设 U-Net 和 PHD-Net 都会生成包含有用信息的热图,以量化预测的不确定性 - 但它们同样有效吗?为此,我们比较了两种模型在三种不确定性估计方法下的性能:基线方法、我们提出的将基线扩展到网络集合的方法以及黄金标准方法。每种方法都从预测的热图中提取坐标值,并估计预测的不确定性。
4.2.1 Single Maximum Heatmap Activation (S-MHA)
4.2.1 单次最大热图激活(S-MHA)
We introduce the baseline coordinate extraction and uncertainty measure. We use the standard method to obtain the predicted landmark’s coordinates from the predicted heatmap , by finding the pixel with the highest activation:
我们介绍基线坐标提取和不确定性度量。我们使用标准方法从预测热图 中获取预测地标坐标 ,通过查找激活值最高的像素:
| (2) |
We hypothesise that the pixel activation at the coordinates can describe the model’s uncertainty: the higher the activation, the lower the uncertainty, and the lower the prediction error. However, due to this inverse relationship, this measures “confidence”, not uncertainty.
我们假设坐标 处的像素激活可以描述模型的不确定性:激活越高,不确定性越低,预测误差也越低。然而,由于这种反比关系,这衡量的是“信心”,而不是不确定性。
We transform our confidence metric to an “uncertainty” metric , by applying the following transformation to the pixel activation at the predicted landmark location:
我们通过将以下转换应用于预测地标位置的像素激活,将置信度度量转换为“不确定性”度量 :
| (3) |
where is a small constant scalar that prevents . Now, as the pixel activation at increases, decreases.
其中 是一个小的常数标量,可以防止 。现在,随着 处的像素激活增加, 减少。
We call the transformed activation of this peak pixel Single Maximum Heatmap Activation (S-MHA). This is a continuous value bounded between for U-Net, and bounded between for PHD-Net, where is the number of patches. The lower the S-MHA, the lower the uncertainty.
我们将此峰值像素的转换激活称为单一最大热图激活 (S-MHA)。对于 U-Net,这是一个介于 之间的连续值,对于 PHD-Net,这是介于 之间的连续值,其中 是补丁数。 S-MHA 越低,不确定性越低。
4.2.2 Ensemble Maximum Heatmap Activation (E-MHA)
4.2.2 集成最大热图激活(E-MHA)
In this work we extend the S-MHA uncertainty measure to ensemble models. We hypothesise that E-MHA should hold a stronger correlation with error than S-MHA due to the additional robustness an ensemble of models affords. We generate the mean heatmap of the models in the ensemble, and obtain the predicted landmark coordinates as the pixel with the highest activation:
在这项工作中,我们将 S-MHA 不确定性度量扩展到集成模型。我们假设 E-MHA 与误差的相关性应该比 S-MHA 更强,因为模型集合提供了额外的稳健性。我们生成集合中 模型的平均热图,并获得预测的地标坐标作为具有最高激活的像素:
| (4) |
Again, we hypothesise the activation of this pixel correlates with model confidence. Similar to S-MHA, we inverse the pixel activation and add a small to the activation of to give us our uncertainty measure, :
同样,我们假设该像素的激活与模型置信度相关。与 S-MHA 类似,我们反转像素激活,并在 的激活中添加一个小的 ,以给出我们的不确定性度量 :
| (5) |
E-MHA is a continuous value constrained to the same bounds as S-MHA. This is a form of late feature fusion, combining features from all models before a decision is made.
E-MHA 是一个连续值,其边界与 S-MHA 相同。这是后期特征融合的一种形式,在做出决策之前结合所有模型的特征。
4.2.3 Ensemble Coordinate Prediction Variance (E-CPV)
4.2.3 集合坐标预测方差(E-CPV)
We also implement the gold standard of uncertainty estimation: ensemble coordinate prediction variance [6]. The more the models disagree on where the landmark is, the higher the uncertainty.
我们还实现了不确定性估计的黄金标准:集合坐标预测方差[6]。模型对地标位置的分歧越多,不确定性就越高。
To extract a landmark’s coordinates we first use the traditional S-MHA coordinate extraction method on each of the models’ predicted heatmaps. Then, we use decision-level fusion to calculate the mean coordinate of the individual predictions to compute the final coordinate predictions :
为了提取地标的坐标,我们首先在每个 模型的预测热图上使用传统的 S-MHA 坐标提取方法。然后,我们使用决策级融合来计算各个预测的平均坐标,以计算最终的坐标预测 :
| (6) |
We generate the Ensemble Coordinate Prediction Variance (E-CPV) by calculating the mean absolute difference between the model predictions and :
我们通过计算 模型预测 和 之间的平均绝对差来生成集合坐标预测方差 (E-CPV):
| (7) |
This is a continuous value bounded between , where and are the height and width of the original image, respectively. The more the models disagree on the landmark location, the higher the coordinate prediction variance, and the higher the uncertainty.
这是一个介于 之间的连续值,其中 和 分别是原始图像的高度和宽度。模型对地标位置的分歧越多,坐标预测方差就越高,不确定性就越高。
Unlike S-MHA and E-MHA, this metric completely ignores the value of the heatmap activations, potentially losing some useful uncertainty information by opting for fusion at the decision-level.
与 S-MHA 和 E-MHA 不同,该指标完全忽略了热图激活的值,通过选择决策级别的融合可能会丢失一些有用的不确定性信息。
4.3 Quantile Binning: Categorising Predictions by Uncertainty and Estimating Error Bounds
4.3分位数分箱:按不确定性对预测进行分类并估计误差范围
We leverage the described uncertainty measures to inform the subject-level uncertainty of any given prediction, i.e. is the model’s prediction likely to be accurate, or inaccurate based on this uncertainty value? We propose a data-driven approach, Quantile Binning, using a hold-out validation set to establish thresholds delineating varying levels of uncertainty specific to each trained model. We use these learned thresholds to categorise our predictions into bins and estimate error bounds for each bin. We opt for a data-driven approach compared to a rule-based approach for two reasons: 1) When models are constrained to a limited dataset (in the order of hundreds of training samples), they can have difficulty converging well, particularly in the task of localizing a difficult landmark. 2) If the limited dataset contains high variance, different training sets can lead to confoundingly different learned weights under the same model architecture and training schedule.
Therefore, establishing a set of thresholds for each model is more invariant to training differences compared to using the same thresholds for all models.
我们利用所描述的不确定性度量来告知任何给定预测的主体级不确定性,即根据该不确定性值,模型的预测可能是准确的还是不准确的?我们提出了一种数据驱动的方法,即分位数分箱,使用保留验证集来建立阈值,描述每个训练模型特有的不同程度的不确定性。我们使用这些学习到的阈值将我们的预测分类为箱并估计每个箱的误差范围。与基于规则的方法相比,我们选择数据驱动的方法有两个原因:1)当模型被限制在有限的数据集(大约数百个训练样本)时,它们可能很难很好地收敛,特别是在定位困难地标的任务。 2)如果有限的数据集包含高方差,则不同的训练集可能会导致在相同的模型架构和训练计划下产生令人困惑的不同学习权重。因此,与对所有模型使用相同的阈值相比,为每个模型建立一组阈值对于训练差异更具有不变性。
Quantile Binning is application agnostic; applicable to any data as long as it consists of continuous tuples of .
分位数分箱与应用程序无关;适用于任何数据,只要它由 的连续元组组成。
In this paper, we generate these pairings after the landmark localization model is trained. We use a hold-out validation set and make coordinate predictions and uncertainty estimates using each of our three uncertainty measures described in Section 4.2. Since we have the ground truth annotations of the validation set we can produce continuous tuples for each uncertainty measure.
在本文中,我们在训练地标定位模型后生成这些配对。我们使用保留验证集,并使用第 4.2 节中描述的三个不确定性度量中的每一个进行坐标预测和不确定性估计。由于我们有验证集的真实注释,我们可以为每个不确定性度量生成连续的 元组。
4.3.1 Establishing Quantile Thresholds
4.3.1建立分位数阈值
We aim to categorise predictions using our continuous uncertainty metrics into bins. We make the following assumption: The true function between a good uncertainty measure and localization error is monotonically increasing (i.e. the higher the uncertainty, the higher the error).
我们的目标是使用连续不确定性指标将预测分类为 箱。我们做出以下假设:良好的不确定性测量和定位误差之间的真实函数是单调递增的(即不确定性越高,误差越高)。
Quantile binning is a non-parametric method that fits well with these assumptions - a variant of histogram binning which is commonly used for calibration of predictive models [26, 27]. By considering the data in quantiles rather than intervals, we can better capture a skewed distribution as the outliers in the tail of the distribution can be grouped into the same group. In other words, quantiles divide the probability distribution into areas of approximately equal probability.
分位数分箱是一种非常适合这些假设的非参数方法 - 直方图分箱的一种变体,通常用于校准预测模型 [26, 27]。通过考虑分位数而不是区间中的数据,我们可以更好地捕获偏态分布,因为分布尾部的异常值可以分为同一组。换句话说,分位数将概率分布划分为概率近似相等的区域。
This property allows us to interrogate model-specific (epistemic) uncertainties. Rather than compute uncertainty thresholds based on predefined error thresholds for each bin, we use Quantile Binning to create thresholds that group our samples in relative terms. This enables the user to flag the worst of predictions. We describe the steps below.
这个属性使我们能够询问特定于模型的(认知)不确定性。我们不是根据每个分箱的预定义误差阈值计算不确定性阈值,而是使用分位数分箱来创建按相对项对样本进行分组的阈值。这使用户能够标记最差的预测 。我们描述以下步骤。
First, for any given uncertainty measure we sort our validation set tuples in ascending order of their uncertainty value and sequentially group them into equal-sized bins . We assign each bin a pair of boundaries defined by the uncertainty values at the edges of the bin to create an interval: . To capture all predictions at the tail ends of the distribution, we set , and .
首先,对于任何给定的不确定性度量,我们按不确定性值的升序对验证集 元组进行排序,并按顺序将它们分组到 相等大小的容器 中。我们为每个 bin 分配一对由 bin 边缘的不确定性值定义的边界,以创建一个区间: 。为了捕获分布尾部的所有预测,我们设置 和 。
During inference, we use these boundaries to bin our predictions into bins (…), with uncertainty increasing with each bin. For each predicted landmark with uncertainty where , is binned into . The distribution of samples should be uniform across the bins, due to the quantile method we used to obtain thresholds.
在推理过程中,我们使用这些边界将预测分为 箱( … ),不确定性随着每个箱的增加而增加。对于每个具有不确定性 的预测地标 ,其中 、 被分入 。由于我们用于获取阈值的分位数方法,样本在各个箱中的分布应该是均匀的。
The higher the value of , the more fine-grained we can categorise our uncertainty estimates. However, as increases the method becomes more sensitive to any miscalibration of the uncertainty measure, leading to less accurate prediction binnings.
的值越高,我们对不确定性估计的分类就越细粒度。然而,随着 的增加,该方法对不确定性测量的任何错误校准变得更加敏感,导致预测分箱不太准确。
This method is agnostic of the scale used in the uncertainty measure, and therefore is applicable to all our defined uncertainty measures.
该方法与不确定性度量中使用的尺度无关,因此适用于我们定义的所有不确定性度量。
4.3.2 Estimating Error Bounds using Isotonic Regression
4.3.2使用等张回归估计误差界限
Establishing thresholds has allowed us to filter predictions by uncertainty in relative terms, but we lack a method to estimate absolute localization error for each bin. For example, for an easy landmark, the samples in may have a very low localization error in absolute terms, but for a more difficult landmark even the lowest relative uncertainty samples in may have a high error. Therefore, in order to offer users information about the expected error for each group, we present a data-driven approach to predict error bounds.
建立阈值使我们能够根据相对的不确定性来过滤预测,但我们缺乏一种方法来估计每个箱的绝对定位误差。例如,对于一个简单的地标, 中的样本在绝对值上可能具有非常低的定位误差,但对于更困难的地标,即使 中的相对不确定性最低的样本也可能具有很高的误差。因此,为了向用户提供有关每个组的预期误差的信息,我们提出了一种数据驱动的方法来预测误差范围。
First, we estimate the true function between the uncertainty measure and localization error using our hold-out validation set. However, since the validation tuples represent a small random sample of the true distribution, plotting this relationship may be noisy. Therefore, to make the best approximation of our assumed true function, we use isotonic regression to fit a monotonically increasing line between uncertainty and localization error, using our validation tuples. Isotonic regression is a method to fit a free-form, non-decreasing line to a set of observations, also commonly used for predictive model calibration [28, 26]. It is non-parametric, so can learn the true distribution if given enough i.i.d. data. The regression seeks a weighted least squares fit subject to the constraint that :
首先,我们使用我们的保留验证集来估计不确定性测量和定位误差之间的真实函数。然而,由于验证元组代表真实分布的小随机样本,因此绘制这种关系可能会有噪声。因此,为了对我们假设的真实函数进行最佳逼近,我们使用验证元组,使用等渗回归来拟合不确定性和定位误差之间的单调递增线。等渗回归是一种将自由形式的非递减线拟合到一组观测值的方法,也常用于预测模型校准 [28, 26]。它是非参数的,因此如果给定足够的 i.i.d 就可以学习真实的分布。数据。回归寻求加权最小二乘拟合 ,但受到 约束:
| (8) |
where and is the number of pairs. In our case, the observations , are the tuples.
其中 和 是 对的数量。在我们的例子中,观测值 是 元组。
Next, we use our isotonically regressed line to estimate error bounds for each of our quantile bins. Since we are interested in error bounds that only increase with uncertainty, we discard the potentially noisy observed tuples from our validation set and instead predict error bounds from the uncertainty values using the fitted function. For each bin’s threshold interval [, ), we estimate the expected error interval: [. We use these values as the estimated lower and upper error bounds of the predictions in bin . Note, that for we only estimate an upper bound, and for we only estimate a lower bound.
接下来,我们使用等渗回归线来估计每个分位数箱的误差界限。由于我们感兴趣的是仅随着不确定性而增加的误差界限,因此我们从验证集中丢弃了潜在有噪声的观察元组,而是使用拟合函数从不确定性值预测误差界限。对于每个 bin 的阈值区间 [ , ),我们估计预期误差区间:[ 。我们使用这些值作为 bin 中预测的估计误差下限和上限。请注意,对于 ,我们仅估计上限,对于 ,我们仅估计下限。
In summary, we use a data-driven approach to learn thresholds to progressively filter predictions at inference into bins of increasing uncertainty, and assign each bin estimated error bounds.
总之,我们使用数据驱动的方法来学习阈值,以逐步将推理时的预测过滤到不确定性不断增加的 箱中,并为每个箱分配估计误差范围。
4.4 Evaluation Metrics for Uncertainty Measures
4.4不确定性度量的评估指标
In this subsection we construct methods to evaluate how well an uncertainty measure’s predicted bins represent the true error quantiles, and how accurate each bin’s estimated error bounds are.
在本小节中,我们构建方法来评估不确定性度量的预测箱代表真实误差分位数的程度,以及每个箱的估计误差范围的准确度。
4.4.1 Evaluating the Predicted Bins
4.4.1 评估预测的 bin
A good uncertainty measure will have a strong correlation with localization error. Therefore, it should provide quantile thresholds that correspond to the true error quantiles. For example, since Bin contains the predictions with the uncertainties at the lowest quantile, the localization errors of the predictions in should be the lowest quantile of the test set. This can be generalised to each group, until , which should contain the errors in the quantile.
一个好的不确定性度量将与定位误差有很强的相关性。因此,它应该提供与真实误差分位数相对应的分位数阈值。例如,由于 Bin 包含不确定性位于最低 分位数的预测,因此 中的预测的定位误差应该是测试集的最低 分位数。这可以推广到每个组,直到 ,其中应包含 分位数中的错误。
To evaluate this desired property, we propose to measure the similarity between each predicted bin and its respective theoretically perfect bin.
为了评估这个所需的属性,我们建议测量每个预测的箱与其各自理论上完美的箱之间的相似性。
We create the ground truth (GT) bins by ordering the test set samples in ascending order of error. Then, we sequentially bin them into equally sized bins: .
我们通过按误差升序对测试集样本进行排序来创建真实值 (GT) 箱。然后,我们按顺序将它们放入 相同大小的容器中: 。
For each predicted and GT bin pair & , we calculate the Jaccard Index (JI) between them and report the mean measure of each bin across all folds:
对于每个预测和 GT bin 对 和 ,我们计算它们之间的 Jaccard 指数 (JI) 并报告所有折叠中每个 bin 的平均度量:
| (9) |
The higher the JI, the better the uncertainty measure has binned predictions by localization error. Therefore, it follows that the higher the JI, the better the uncertainty measure predicts localization error.
JI 越高,不确定性度量通过定位误差对预测进行分类的效果就越好。因此,JI 越高,不确定性测度对定位误差的预测效果就越好。
4.4.2 Accuracy of Estimated Error bounds
4.4.2估计误差范围的准确性
A good uncertainty measure will have a monotonically increasing relationship with localization error. Therefore, estimating the true function using isotonic regression should provide accurate error bound estimations.
一个好的不确定性度量将与定位误差具有单调递增的关系。因此,使用等渗回归估计真实函数应该提供准确的误差界限估计。
To measure this, for each bin , we calculate the percentage of predictions whose error falls between the estimated error bound interval, . The higher the percentage, the higher the accuracy of our estimated upper error bounds.
为了衡量这一点,对于每个 bin ,我们计算误差落在估计误差界限区间 之间的预测的百分比。百分比越高,我们估计的误差上限的准确性就越高。
图 2:(a) 4 室 (4CH) CMR 的标志:洋红色 = 三尖瓣;黄色=二尖瓣;红色 = 左心室心尖。 (b) 短轴 (SA) CMR 的标志:洋红色 = 上右心室插入点瓣膜;黄色=右心室下插入点;红色=右心室游离壁的下侧向反射。
5 Datasets 5个数据集
We perform our experiments using a dataset from the ASPIRE Registry [29], with Cardiac Magnetic Resonance Imaging (CMR) sequences containing a mix of subjects suffering from pulmonary arterial hypertension (PAH) and no pulmonary hypertension (PH). Each subject has a four chamber (4CH) view and/or a short axis view (SA). Each CMR sequence has a spatial resolution of pixels, where each pixel represents 0.9375mm of the organ, and 20 frames (we use only the first frame for landmark localization in this study). There are 303 SA images, each with three annotated landmarks: the inferior right ventricle insertion point (infSA), the superior right ventricle insertion point (supSA), and the inferior lateral reflection of the right ventricle free wall (RVSA). There are 422 4CH images, each with three annotated landmarks: the apex of the left ventricle at end diastole (LVDEV Apex), the mitral valve (mitral), and tricuspid valve (tricuspid). The 4CH dataset represents a more challenging landmark localization task as the images have much higher variability than the SA dataset. The landmarks were decided and manually labelled by a radiologist, as shown in Fig. 2. For this study, we consider the SA images the EASY dataset, and the 4CH images the HARD dataset.
我们使用 ASPIRE 注册中心 [29] 的数据集进行实验,其中心脏磁共振成像 (CMR) 序列包含患有肺动脉高压 (PAH) 和非肺动脉高压 (PH) 的受试者的混合。每个受试者都有四室 (4CH) 视图和/或短轴视图 (SA)。每个 CMR 序列的空间分辨率为 像素,其中每个像素代表器官的 0.9375mm,以及 20 帧(在本研究中我们仅使用第一帧进行地标定位)。有 303 个 SA 图像,每个图像都有三个带注释的标志:右心室下插入点 (infSA)、右心室上插入点 (supSA) 和右心室游离壁下侧向反射 (RVSA)。有 422 张 4CH 图像,每张图像都有三个带注释的标志:舒张末期左心室心尖 (LVDEV Apex)、二尖瓣 (mitral) 和三尖瓣 (tricuspid)。 4CH 数据集代表了更具挑战性的地标定位任务,因为图像比 SA 数据集具有更高的可变性。界标由放射科医生确定并手动标记,如图 2 所示。在本研究中,我们将 SA 图像视为 EASY 数据集,将 4CH 图像视为 HARD 数据集。
6 Experiments and Results 6实验与结果
First, in Sec. 6.2 we present the baseline landmark localization performance of PHD-Net and U-Net over both SA and 4CH datasets using the S-MHA, E-CPV, and E-MHA methods to extract coordinates. This gives us a comparison of the coordinate extraction performance from each of our methods, and a baseline to measure the effectiveness of each method’s uncertainty estimation. Second, in Sec. 6.3 we interrogate how using Quantile Binning with our uncertainty measures delineates predictions in terms of their localization error, and compare the predicted bins to the ground truth error quantiles. We show a practical example of how filtering out highly uncertain predictions can dramatically increase the proportion of acceptable localization predictions. Finally, in Sec. 6.4 we assess how well the uncertainty measures can predict error bounds for each bin. When comparing between we use an unpaired -test () to test for significance. When comparing uncertainty metrics among the same Bin category and model, we use a paired -test () to test for significance.
首先,在第二节。 6.2 我们使用 S-MHA、E-CPV 和 E-MHA 方法提取坐标,展示了 PHD-Net 和 U-Net 在 SA 和 4CH 数据集上的基线地标定位性能。这为我们提供了每种方法的坐标提取性能的比较,以及衡量每种方法不确定性估计有效性的基线。其次,在秒中。 6.3 我们询问如何使用分位数分箱与我们的不确定性度量来描述预测的定位误差,并将预测分箱与地面真实误差分位数进行比较。我们展示了一个实际的例子,说明过滤掉高度不确定的预测如何能够显着增加可接受的本地化预测的比例。最后,在第二节。 6.4 我们评估不确定性测量对每个箱的误差界限的预测效果。在比较 之间时,我们使用未配对的 测试 ( ) 来测试显着性。在比较相同 Bin 类别和模型之间的不确定性指标时,我们使用配对的 测试 ( ) 来测试显着性。
6.1 Experimental Setup 6.1 实验设置
表 1:概述的不确定性方法的定位误差 (mm)。 All 表示整个预测集; 表示不确定性最低的子集。报告所有折叠和所有地标的平均误差和标准偏差。粗体表示给定数据集的行中的最佳结果。
| 4 Chamber Images 4 室图像 | Short Axis Images 短轴图像 | |||
|---|---|---|---|---|
| Method 方法 | U-Net 优网 | PHD-Net PHD网 | U-Net 优网 | PHD-Net PHD网 |
| S-MHA All S-MHA 全部 | 10.00 18.99 10.00 18.99 | 11.07 21.33 11.07 21.33 | 5.86 14.19 5.86 14.19 | 3.58 3.52 3.58 3.52 |
| S-MHA S-MHA | 6.79 6.09 6.79 6.09 | 5.80 9.03 5.80 9.03 | 3.62 2.45 3.62 2.45 | 2.78 1.99 2.78 1.99 |
| E-MHA All E-MHA 全部 | 6.36 8.01 6.36 8.01 | 9.14 18.11 9.14 18.11 | 4.37 8.86 4.37 8.86 | 3.36 3.50 3.36 3.50 |
| E-MHA E-MHA | 4.93 2.85 4.93 2.85 | 4.70 3.21 4.70 3.21 | 2.98 2.09 2.98 2.09 | 2.39 1.90 2.39 1.90 |
| E-CPV All E-CPV 全部 | 8.13 10.16 8.13 10.16 | 9.42 13.07 9.42 13.07 | 4.97 7.51 4.97 7.51 | 3.22 2.93 3.22 2.93 |
| E-CPV 电子CPV | 5.34 3.00 5.34 3.00 | 5.10 6.76 5.10 6.76 | 3.75 2.13 3.75 2.13 | 2.47 2.08 2.47 2.08 |
We split both datasets into 8 folds, and perform 8-fold cross validation for both U-Net and PHD-Net. For each iteration, we select one fold as our testing set, one our hold-out validation set and the remaining 6 as our training set. We select for the ensemble methods, training 5 separate models at each fold. We chose to compromise with computational constraints, asserting that 5 models are representative to compare the uncertainty methods for our purposes. Each model is identical apart from randomly initialised weights. We randomly select a model for our S-MHA uncertainty measure. For our Quantile Binning method, we select for 5 bins, balancing the constraints of our small datasets with a useful number of uncertainty categories.
我们将两个数据集分成 8 份,并对 U-Net 和 PHD-Net 进行 8 份交叉验证。对于每次迭代,我们选择 1 个折叠作为我们的测试集,一个作为我们的保留验证集,剩下的 6 个作为我们的训练集。我们选择 作为集成方法,每次训练 5 个单独的模型。我们选择 来妥协计算限制,断言 5 个模型具有代表性,可以比较我们的目的的不确定性方法。除了随机初始化的权重之外,每个模型都是相同的。我们为 S-MHA 不确定性测量随机选择一个模型。对于我们的分位数分箱方法,我们选择 作为 5 个分箱,平衡小数据集的约束与有用数量的不确定性类别。
We implement our U-Net model [7] using the MONAI package [30]. We design the model with 5 encoding-decoding levels, creating 1.63M learnable parameters. We modify the objective function from image segmentation to simultaneous landmark localization, minimising the mean squared error between the target and predicted heatmaps. We use the full pixel image as input, and learn heatmaps of the same size. We generate target heatmaps using Eq. (1) with a standard deviation of 8, and train for 1000 epochs with a batch size of 2, and a learning rate of 0.001 using the Adam Optimiser (settings from [8]).
我们使用 MONAI 包 [30] 实现我们的 U-Net 模型 [7]。我们设计了具有 5 个编码-解码级别的模型,创建了 163 万个可学习参数。我们将目标函数从图像分割修改为同步地标定位,从而最小化目标和预测热图之间的均方误差。我们使用完整的 像素图像作为输入,并学习相同大小的热图。我们使用方程式生成目标热图。 (1) 标准差为 8,使用 Adam 优化器(设置来自 [8])训练 1000 个周期,批量大小为 2,学习率为 0.001。
We implement our PHD-Net model following [8], creating a model with 0.06M learnable parameters. For all experiments we trained PHD-Net for 1000 epochs using a batch size of 32 and a learning rate of 0.001, using the Adam Optimiser. We train one landmark at a time. Note, the only difference in setup from [8] in this work is different fold splits and training for an additional 500 epochs (same as U-Net) with no early stopping, since we now use our validation set for Quantile Binning.
我们按照 [8] 实现 PHD-Net 模型,创建一个具有 0.06M 可学习参数的模型。对于所有实验,我们使用 Adam 优化器,使用 32 的批量大小和 0.001 的学习率训练 PHD-Net 1000 个时期。我们一次训练一个地标。请注意,这项工作中设置与 [8] 的唯一区别是不同的折叠分割和额外 500 个 epoch 的训练(与 U-Net 相同),没有提前停止,因为我们现在使用分位数分箱的验证集。
6.2 Baseline Landmark Localization Performance
6.2 基准地标定位性能
Figure 3 and the All rows in Table 1 show the baseline performance for U-Net and PHD-Net at localizing landmarks in our 4CH and SA datasets. We make the following observations:
图 3 和表 1 中的所有行显示了 U-Net 和 PHD-Net 在 4CH 和 SA 数据集中定位地标时的基线性能。我们提出以下观察:
-
•
When considering the entire set of landmarks (All), performance is better on the SA dataset for both models, with PHD-Net outperforming U-Net. On the 4CH dataset, U-Net outperforms PHD-Net, suggesting the higher capacity model of U-Net is more robust to datasets with large variations.
• 当考虑整个地标集(全部)时,两种模型在 SA 数据集上的性能都更好,其中 PHD-Net 的性能优于 U-Net。在 4CH 数据集上,U-Net 的性能优于 PHD-Net,这表明 U-Net 的更高容量模型对于变化较大的数据集更加稳健。 -
•
Simply using a single model with our S-MHA strategy is predictably less robust than ensemble approaches.
• 可以预见的是,简单地使用单个模型与我们的 S-MHA 策略相比,集成方法的鲁棒性较差。 -
•
E-MHA outperforms the previous gold standard of E-CPV for coordinate extraction. However, does it outperform E-CPV in terms of uncertainty estimation? We explore this in Section 6.3.
• E-MHA 优于以前的坐标提取黄金标准E-CPV。然而,它在不确定性估计方面是否优于 E-CPV?我们将在 6.3 节中对此进行探讨。 -
•
The standard deviation in the error for the baseline All results in Table 1 is high for all models. We aspire to catch these bad predictions using quantile binning in Sec. 6.3.
• 对于所有模型,表1 中的所有结果的基线误差的标准偏差都很高。我们渴望在第二节中使用分位数分组来捕捉这些糟糕的预测。 6.3.
图 3:整个预测集中定位误差的累积分布,显示给定误差阈值下的预测百分比。垂直线是可接受的误差阈值,由放射科医生选择。它显示了使用不确定性方法的坐标提取的所有折叠和地标的结果。百分比越高越好。
(a) 每个 Bin - 4CH 数据集的定位误差(越低越好)。
(b) 每个 Bin - SA 数据集的定位误差(越低越好)。
(c) 每个 Bin - 4CH 数据集的 Jaccard 指数(越高越好)。
(d) 每个 Bin - SA 数据集的 Jaccard 指数(越高越好)。
(e) 估计误差范围精度 - 4CH 数据集(越高越好)。
(f) 估计误差范围精度 - SA 数据集(越高越好)。
图 4:使用我们的三种坐标提取和不确定性估计方法对所有地标和折叠进行 U-Net 和 PHD-Net 分位数分档的结果。分箱按不确定性降序排列( 最高不确定性, 最低不确定性)。 (a) 和 (b) 显示每个箱的平均定位误差,随着我们向不确定性较低的箱移动,误差会减小。 (c) 和 (d) 呈现了杰卡德指数,显示了预测的 bin 与地面真实误差分位数的相似程度。 (e) 和 (f) 可视化估计误差范围精度,显示每个 bin 的估计误差范围内的预测百分比。
6.3 Analysis of the Predicted Quantile Bins
6.3 预测分位数箱的分析
We apply quantile binning to each uncertainty measure: S-MHA, E-MHA and E-CPV. We compare results over U-Net and PHD-Net for both the SA and 4CH datasets.
我们将分位数分箱应用于每个不确定性度量:S-MHA、E-MHA 和 E-CPV。我们比较了 SA 和 4CH 数据集的 U-Net 和 PHD-Net 的结果。
We found the most useful information is at the tail ends of the uncertainty distributions. Figs. 3(c) & 3(d) plot the Jaccard Index between ground truth error quantiles and predicted error quantiles. The predictions in the highest uncertainty quantile () is significantly better at capturing the correct subset of predictions than the intermediate bins (). Similarly, in most cases the bin representing the lowest uncertainties () had a significantly higher Jaccard Index than the intermediate bins, but still lower than . Figs. 3(a) & 3(b) show the mean error of the samples of each quantile bin over both datasets. The most significant reduction in localization error is from to for all uncertainty measures, further indicating our uncertainty measures are well calibrated to filter out gross mispredictions. These findings suggest that most of the utility in the uncertainty measures investigated can be found at the tail ends of the scale. This is an intuitive finding, as the predictions in are certainly uncertain, and the predictions in are certainly certain.
我们发现最有用的信息位于不确定性分布的尾部。无花果。图 3(c) 和 3(d) 绘制了真实误差分位数和预测误差分位数之间的杰卡德指数。最高不确定性分位数 ( ) 中的预测在捕获正确的预测子集方面明显优于中间箱 ( )。同样,在大多数情况下,代表最低不确定性的 bin ( ) 的杰卡德指数显着高于中间 bin,但仍低于 。无花果。图 3(a) 和 3(b) 显示了两个数据集上每个分位数箱样本的平均误差。对于所有不确定性度量,定位误差最显着的减少是从 到 ,进一步表明我们的不确定性度量经过良好校准,可以过滤掉严重的错误预测。这些发现表明,所调查的不确定性度量中的大部分效用可以在量表的末端找到。这是一个直观的发现,因为 中的预测肯定是不确定的,而 中的预测肯定是确定的。
The worse trained the landmark localization model, the more useful the uncertainty measure. Table 1 shows the localization error of all methods, models and datasets for the entire set (All) and lowest uncertainty subset () of predictions. PHD-Net’s baseline localization performance on the 4CH dataset was worse than U-Net. However, when we consider the lowest uncertainty subset of predictions (), PHD-Net sees a 47% average reduction in error from all predictions (All), compared to U-Net’s average reduction of 30%. Similarly, U-Net performed worse than PHD-Net for the SA dataset, but saw an average error reduction of 31% compared to PHD-Net’s 25%. This suggests that all investigated uncertainty measures are more effective at identifying gross mispredictions when models are poorly trained.
地标定位模型训练得越差,不确定性度量就越有用。表 1 显示了整个预测集 (All) 和最低不确定性子集 ( ) 的所有方法、模型和数据集的定位误差。 PHD-Net 在 4CH 数据集上的基线定位性能比 U-Net 差。然而,当我们考虑预测的最低不确定性子集 ( ) 时,PHD-Net 发现所有预测 (All) 的误差平均减少了 47%,而 U-Net 的平均误差减少了 30%。同样,U-Net 在 SA 数据集上的表现比 PHD-Net 差,但平均误差比 PHD-Net 的 25% 减少了 31%。这表明,当模型训练不足时,所有研究的不确定性度量在识别严重错误预测方面更有效。
Using heatmap-based uncertainty measures is generalisable across heatmap generation approaches. The bin similarities in Figs. 3(d) & 3(c) show that using S-MHA and E-MHA yields similar performance with PHD-Net and U-Net, despite their different heatmap derivations. Surprisingly using E-MHA does not give a significant increase in bin similarity compared to S-MHA, suggesting the thresholds remain relatively stable across models.
使用基于热图的不确定性度量可以在热图生成方法中推广。图 1 和 2 中的 bin 相似之处图 3(d) 和 3(c) 显示,使用 S-MHA 和 E-MHA 与 PHD-Net 和 U-Net 产生相似的性能,尽管它们的热图推导不同。令人惊讶的是,与 S-MHA 相比,使用 E-MHA 并没有显着增加 bin 相似性,这表明阈值在模型中保持相对稳定。
No investigated method is conclusively best for estimating uncertainty in all scenarios. For the more challenging 4CH data, Fig. 3(c) shows E-CPV is significantly better than S-MHA and E-MHA for both models at capturing the true error quantiles, corroborating the findings of [6]. E-CPV is particularly good at identifying the worst predictions (). For the easier SA data, no method has a significantly higher Jaccard Index.
Therefore, when we generalise across both models and datasets, all uncertainty measures fared broadly similar on average in terms of error reduction between the entire set and the subset of predictions. S-MHA had an average error reduction of 35.07%, E-MHA 32.94% and E-CPV 32%.
没有一种研究方法能够最终最好地估计所有情况下的不确定性。对于更具挑战性的 4CH 数据,图 3(c) 显示 E-CPV 在捕获真实误差分位数方面明显优于 S-MHA 和 E-MHA,这证实了 [6] 的发现。 E-CPV 特别擅长识别最坏的预测 ( )。对于更简单的 SA 数据,没有任何方法具有明显更高的 Jaccard 指数。因此,当我们对模型和数据集进行概括时,所有不确定性度量在整个集合和预测的 子集之间的误差减少方面平均表现大致相似。 S-MHA 的平均误差降低了 35.07%,E-MHA 降低了 32.94%,E-CPV 降低了 32%。
Despite similar performances in uncertainty estimation, we found E-MHA yields the greatest localization performance overall. Table 1 shows E-MHA offers the best localization performance for across both datasets and models. This is due to the combination of offering the most robust coordinate extraction on average (Fig. 3), and similar uncertainty estimation performance (Fig. 3(c), Fig. 3(d)). We more concretely demonstrate Quantile Binning’s ability to identify low uncertainty predictions in Fig. 5. We clearly observe a significant increase in the percentage of images below the acceptable error threshold of 5mm when considering only predictions in - with E-MHA giving the greatest proportion of acceptable predictions. For both datasets we observe that PHD-Net using E-MHA has a higher proportion of acceptable predictions compared to U-Net. If we consult Table 1, we also observe that PHD-Net’s predictions in bin indeed have lower mean localization error than U-Net’s corresponding bin for both datasets, with E-MHA offering the lowest average localization error.
尽管在不确定性估计方面具有相似的性能,但我们发现 E-MHA 总体上产生了最好的定位性能。表 1 显示 E-MHA 在数据集和模型上为 提供了最佳的本地化性能。这是由于提供了平均最稳健的坐标提取(图 3)和类似的不确定性估计性能(图 3(c)、图 3(d))。我们在图 5 中更具体地证明了分位数分箱识别低不确定性预测的能力。当仅考虑 中的预测时,我们清楚地观察到低于 5mm 可接受误差阈值的图像百分比显着增加 - E-MHA 给出了可接受的预测的最大比例。对于这两个数据集,我们观察到与 U-Net 相比,使用 E-MHA 的 PHD-Net 具有更高比例的可接受预测。如果我们查阅表 1,我们还观察到,对于两个数据集,PHD-Net 在 bin 中的预测确实比 U-Net 的相应 bin 具有更低的平均定位误差,而 E-MHA 提供了最低的平均定位误差。
(a) PHD-Net - 4CH 图像。
(b) U-Net - 4CH 图像。
(c) PHD-Net - SA 图像。
(d) U-Net - SA Pictures。
图 5:定位误差的累积分布,显示给定误差阈值下的预测百分比,将所有预测(全部)与所有折叠和地标的不确定性方法的最低不确定性子集 ( ) 进行比较。垂直线是可接受的误差阈值,由放射科医生选择。百分比越高越好。
6.4 Analysis of Error Bound Estimation
6.4 误差界估计分析
We analyse how accurate the isotonically regressed estimated error bounds are for our quantile bins. Figs. 3(e) & 3(f) show the percentage of samples in each bin that fall between the estimated error bounds.
我们分析了分位数箱的等渗回归估计误差范围的准确度。无花果。图 3(e) 和 3(f) 显示每个箱中落在估计误差范围内的样本百分比。
We found we can predict the error bounds for the two extreme bins better than the intermediate bins. Figs. 3(e) & 3(f) show a similar pattern to the Jaccard Index Figs. 3(c) & 3(d), with the two extreme bins and predicting error bounds significantly more accurately than the inner bins. Again, this indicates the most useful uncertainty information is present at the extremes of the uncertainty distribution, with the predicted uncertainty-error function unable to capture a consistent relationship for the inner quantiles. This finding is intuitive, as it is easier to put a lower error bound on the most uncertain quantile of predictions or upper bound on the most certain predictions, than it is to assign tighter error bounds on middling uncertainty values.
我们发现我们可以比中间箱更好地预测两个极端箱的误差范围。无花果。图 3(e) 和 3(f) 显示了与 Jaccard 索引图类似的模式。 3(c) 和 3(d),两个极端 bin 和 预测误差范围的准确率明显高于内部 bin。同样,这表明最有用的不确定性信息存在于不确定性分布的极端处,而预测的不确定性误差函数无法捕获内部分位数的一致关系。这一发现很直观,因为与对中等不确定性值指定更严格的误差界限相比,对最不确定的预测分位数设置误差下限或对最确定的预测设置上限更容易。
We also found that a well defined upper bound for heatmap activations is important for error bound estimates. For both the 4CH and SA datasets, S-MHA for PHD-Net is significantly more accurate at predicting error bounds for the highest uncertainty quantile compared to the lowest uncertainty quantile (56% & 72% compared to 30% & 27% for 4CH & SA, respectively), correlating with S-MHA capturing a greater proportion of those bins (Jaccard Indexes of 32% & 24% compared to 16% & 15%). On the other hand, U-Net using S-MHA predicts error bounds for low uncertainty bins better than high uncertainty bins. This suggests that although PHD-Net’s heatmap activation is a robust predictor of gross mispredictions, the less tight upper bound of its heatmap activations make it hard to make an accurate prediction for the lowest uncertainty quantile (). This is alleviated by using an ensemble of networks in E-MHA, where the bound accuracy is improved to 62%.
我们还发现,明确定义的热图激活上限对于误差界限估计很重要。对于 4CH 和 SA 数据集,与最低不确定性分位数 相比,PHD-Net 的 S-MHA 在预测最高不确定性分位数 的误差范围方面显着更准确(56% 和 72% 与 30% 相比) 4CH 和 SA 分别为 27% 和 27%),与 S-MHA 捕获这些垃圾箱的更大比例相关(杰卡德指数分别为 32% 和 24%,而 16% 和 15%)。另一方面,使用 S-MHA 的 U-Net 可以比高不确定性箱更好地预测低不确定性箱的误差范围。这表明,尽管 PHD-Net 的热图激活是严重错误预测的稳健预测器,但其热图激活的上限不太严格,因此很难对最低不确定性分位数 ( ) 做出准确预测。通过在 E-MHA 中使用网络集合可以缓解这一问题,其中 边界精度提高到 62%。
E-MHA and E-CPV are more consistent than S-MHA. Overall, there is no significant difference between the error bound estimation accuracy of E-MHA and S-MHA, but Figs. 3(e) & 3(f) show E-MHA has less variation in performance between U-Net and PHD-Net compared to S-MHA, suggesting an ensemble of models is more robust. For the 4CH dataset, PHD-Net using E-CPV is on average significantly more accurate at predicting error bounds than S-MHA and E-MHA. However, there are no significant differences for PHD-Net on the easier SA dataset, nor U-Net on either dataset. There are also no significant differences between U-Net and PHD-Net in error bound estimation accuracy, with each method broadly equally effective for both models.
E-MHA 和 E-CPV 比 S-MHA 更加一致。总体而言,E-MHA 和 S-MHA 的误差界估计精度没有显着差异,但图图 3(e) 和 3(f) 显示,与 S-MHA 相比,E-MHA 在 U-Net 和 PHD-Net 之间的性能差异较小,这表明模型集合更加稳健。对于 4CH 数据集,使用 E-CPV 的 PHD-Net 在预测误差范围方面平均比 S-MHA 和 E-MHA 更准确。然而,PHD-Net 在更简单的 SA 数据集上没有显着差异,U-Net 在任一数据集上也没有显着差异。 U-Net 和 PHD-Net 之间的误差界限估计精度也没有显着差异,每种方法对于这两种模型大致同样有效。
7 Discussion and Conclusion
7讨论与结论
7.1 Summary of Findings 7.1 调查结果总结
This paper presented a general framework to assess any continuous uncertainty measure in landmark localization, demonstrating its use on three uncertainty metrics and two paradigms of landmark localization model. We introduced a new coordinate extraction and uncertainty estimation method, E-MHA, offering the best baseline localization performance and competitive uncertainty estimation.
本文提出了一个评估地标定位中任何连续不确定性度量的通用框架,展示了其在三个不确定性度量和地标定位模型的两种范式上的使用。我们引入了一种新的坐标提取和不确定性估计方法 E-MHA,提供最佳的基线定位性能和有竞争力的不确定性估计。
Our experiments indicate that both heatmap-based uncertainty metrics (S-MHA, E-MHA), as well as the gold standard coordinate variance uncertainty metric (E-CPV) are applicable to both U-Net and PHD-Net. Despite the two models’ distinctly different approaches to generating heatmaps, using the maximum heatmap activation as an indicator for uncertainty is effective for both models. We showed that all investigated uncertainty metrics were effective at filtering out the gross mispredictions () and identifying the 20% most certain predictions (), but could not capture useful information for the intermediate uncertainty bins (-).
我们的实验表明,基于热图的不确定性度量(S-MHA、E-MHA)以及黄金标准坐标方差不确定性度量(E-CPV)都适用于 U-Net 和 PHD-Net。尽管这两个模型生成热图的方法截然不同,但使用最大热图激活作为不确定性的指标对于这两个模型都是有效的。我们表明,所有研究的不确定性指标都能有效过滤掉严重的错误预测 ( ) 并识别 20% 最确定的预测 ( ),但无法捕获中间不确定性区间的有用信息 ( - )。
Our experiments also showed that E-MHA and S-MHA had a surprisingly similar ability to capture the true error quantiles of the best and worst 20% of predictions (Figs. 3(c) & 3(d)), but E-MHA was more consistent with its performance predicting the error bounds of those bins across models (Figs. 3(e) & 3(f)). This suggests that the calibration of the head and tail ends of the heatmap distributions is stable across our ensemble of models, but susceptible to variance when fitting our isotonically regressed line to predict error bounds. On the more challenging 4CH dataset, E-CPV broadly remained the gold standard for filtering out the worst predictions, but this trend did not continue in the easier SA dataset (Fig 5).
我们的实验还表明,E-MHA 和 S-MHA 在捕获最佳和最差 20% 预测的真实误差分位数方面具有惊人的相似能力(图 3(c) 和 3(d)),但 E-MHA与其跨模型预测这些箱的误差范围的性能更加一致(图 3(e) 和 3(f))。这表明热图分布的头部和尾部的校准在我们的模型集合中是稳定的,但在拟合我们的等渗回归线以预测误差范围时容易受到方差的影响。在更具挑战性的 4CH 数据集上,E-CPV 基本上仍然是过滤掉最差预测的黄金标准,但这种趋势在更简单的 SA 数据集中并没有延续(图 5)。
In terms of error bound estimation, we found a strong correlation with the bin’s Jaccard Index to the true quantiles. Bins and could offer good error bound estimates, but the intermediate bins could not (Figs. 3(e) & 3(f)). We found all uncertainty methods performed broadly the same: effective at predicting error bounds for and , but poor at predicting error bounds for -. The one exception was PHD-Net using S-MHA, which could not accurately predict error bounds for due to the high variance in pixel activations of highly certain predictions.
在误差界限估计方面,我们发现箱的杰卡德指数与真实分位数有很强的相关性。 bin 和 可以提供良好的误差范围估计,但中间 bin 不能(图 3(e) 和 3(f))。我们发现所有不确定性方法的表现大致相同:在预测 和 的误差范围方面有效,但在预测 - 的误差范围方面较差。一个例外是使用 S-MHA 的 PHD-Net,由于高度确定的预测的像素激活的高方差,它无法准确预测 的误差范围。
Overall, we found that using E-MHA provided the most robust coordinate extraction method of all methods, showing the best baseline localization error. When considering only the predictions with the lowest uncertainty (), using E-MHA achieves the lowest mean localization error accross all models and datasets.
总的来说,我们发现使用 E-MHA 提供了所有方法中最稳健的坐标提取方法,显示出最佳的基线定位误差。当仅考虑不确定性最低的预测 ( ) 时,使用 E-MHA 可在所有模型和数据集中实现最低的平均定位误差。
7.2 Recommendations
In terms of utility, when resources are available to train an ensemble of models, we recommend to use E-MHA as the coordinate extraction and uncertainty estimation method. E-MHA offers the best baseline localization performance with a sufficient ability to filter out the gross mispredictions () and identify the most certain predictions (). It can also sufficiently estimate error bounds for these two bins. However, between these thresholds the uncertainty metric is poorly calibrated to bin predictions in a robust way, or accurately predict error bounds. If resources are constrained, S-MHA is surprisingly effective at capturing the true error quantiles for bins and , but note that when using a patch-based voting heatmap that is not strictly bounded, the error bound estimation for is not robust.
就实用性而言,当有资源可用于训练模型集合时,我们建议使用 E-MHA 作为坐标提取和不确定性估计方法。 E-MHA 提供了最佳的基线定位性能,有足够的能力过滤掉严重的错误预测 ( ) 并识别最确定的预测 ( )。它还可以充分估计这两个容器的误差范围。然而,在这些阈值之间,不确定性度量很难以稳健的方式校准到箱预测,或者准确地预测误差范围。如果资源有限,S-MHA 在捕获 bin 和 的真实误差分位数方面出人意料地有效,但请注意,当使用不严格限制的基于补丁的投票热图时,🚠 的误差界限估计4🚠不健壮。
7.3 Conclusion
Beyond the above recommendations, we hope the framework described in this paper can be used to assess refined or novel uncertainty metrics for landmark localization, and act as a baseline for future work. Furthermore, we have shown that both the voting derived heatmap of PHD-Net, and the regressed Gaussian heatmap of U-Net can be exploited for uncertainty estimation. In this paper, we only explored the activation of the peak pixel, but it is likely that more informative measures can be extracted from the broader structure of the heatmap, promising greater potential for uncertainty estimation in landmark localization waiting to be uncovered.
除了上述建议之外,我们希望本文描述的框架可用于评估地标定位的精细或新颖的不确定性指标,并作为未来工作的基线。此外,我们还表明,PHD-Net 的投票导出热图和 U-Net 的回归高斯热图都可以用于不确定性估计。在本文中,我们仅探讨了峰值像素的激活,但很可能可以从更广泛的热图结构中提取更多信息量,从而有望在等待被发现的地标定位中进行不确定性估计的更大潜力。
References 参考
-
[1]
R. Beichel, H. Bischof, F. Leberl, and M. Sonka, “Robust active
appearance models and their application to IEEE trans. med. imag.”
IEEE Trans. Med. Imag., vol. 24, no. 9, pp. 1151–1169, 2005.
R. Beichel、H. Bischof、F. Leberl 和 M. Sonka,“稳健的主动外观模型及其在 IEEE 传输中的应用”。医学。想象。” IEEE 传输。医学。图像,卷。 24、没有。 9,第 1151–1169 页,2005 年。 -
[2]
H. J. Johnson and G. E. Christensen, “Consistent landmark and intensity-based
image registration,” IEEE Trans. Med. Imag., vol. 21, no. 5, pp.
450–461, 2002.
H. J. Johnson 和 G. E. Christensen,“一致的地标和基于强度的图像配准”,IEEE Trans。医学。图像,卷。 21、没有。 5,第 450–461 页,2002 年。 -
[3]
M. et al, “Semi-automatic construction of reference standards for evaluation
of image registration,” Medical Image Analysis, vol. 15, no. 1, pp.
71–84, 2011.
M. 等人,“图像配准评估参考标准的半自动构建”,医学图像分析,卷。 15、没有。 1,第 71-84 页,2011 年。 -
[4]
D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang,
“XAI-Explainable artificial intelligence,”
Science robotics, vol. 4, no. 37, 2019.
D. Gunning、M. Stefik、J. Choi、T. Miller、S. Stumpf 和 G.-Z。杨,“XAI-可解释的人工智能”,科学机器人,卷。 4、没有。 2019 年 37 日。 -
[5]
A. Holzinger, “Interactive machine learning for health informatics: When do
we need the human-in-the-loop?” Brain Informatics, vol. 3, no. 2, pp.
119–131, 2016.
A. Holzinger,“健康信息学的交互式机器学习:我们什么时候需要人机交互?”脑信息学,卷。 3、没有。 2,第 119-131 页,2016 年。 -
[6]
D. Drevickỳ and O. Kodym, “Evaluating deep learning uncertainty measures
in cephalometric landmark localization.” in BIOIMAGING, 2020, pp.
213–220.
D. Drevickỳ 和 O. Kodym,“评估头影测量地标定位中的深度学习不确定性测量。” 《生物成像》,2020 年,第 213-220 页。 -
[7]
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for
biomedical image segmentation,” in Proc. MICCAI. Springer, 2015, pp. 234–241.
O. Ronneberger、P. Fischer 和 T. Brox,“U-Net:用于生物医学图像分割的卷积网络”,Proc 中。米卡伊。施普林格,2015 年,第 234-241 页。 -
[8]
L. Schobs, S. Zhou, M. Cogliano, A. J. Swift, and H. Lu,
“Confidence-quantifying landmark localisation for cardiac MRI,” in
IEEE Int. Symp. Biomed. Imag., 2021, pp. 985–988.
L. Schobs、S. Zhou、M. Cogliano、A. J. Swift 和 H. Lu,“心脏 MRI 的置信度量化地标定位”,IEEE Int。症状。生物医学。图像,2021 年,第 985–988 页。 -
[9]
C. Payer, D. Štern, H. Bischof, and M. Urschler, “Integrating spatial
configuration into heatmap regression based CNNs for landmark
localization,” IEEE Trans. Med. Imag., vol. 54, pp. 207–219, 2019.
C. Payer、D. Štern、H. Bischof 和 M. Urschler,“将空间配置集成到基于热图回归的 CNN 中以实现地标定位”,IEEE Trans。医学。图像,卷。 54,第 207-219 页,2019 年。 -
[10] [10]
Z. Zhong, J. Li, Z. Zhang, Z. Jiao, and X. Gao, “An attention-guided deep
regression model for landmark detection in cephalograms,” in Proc.
MICCAI. Springer, 2019, pp. 540–548.
Z.zhong、J.Li、Z.Zhang、Z.Jiao 和 X.Gao,“一种用于头颅照片中地标检测的注意力引导深度回归模型”,Proc。米卡伊。施普林格,2019 年,第 540–548 页。 -
[11] [11]
N. Torosdagli, D. K. Liberton, P. Verma, M. Sincan, J. S. Lee, and U. Bagci,
“Deep geodesic learning for segmentation and anatomical landmarking,”
IEEE Trans. Med. Imag., vol. 38, no. 4, pp. 919–931, 2018.
N. Torosdagli、D. K. Liberton、P. Verma、M. Sincan、J. S. Lee 和 U. Bagci,“用于分割和解剖标志的深度测地线学习”,IEEE Trans。医学。图像,卷。 38,没有。 4,第 919–931 页,2018 年。 -
[12] [12]
Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas, “Quantized densely
connected U-Nets for efficient landmark localization,” in Proc.
ECCV, 2018, pp. 339–354.
Z. Tang、X. Peng、S. Geng、L. Wu、S. Zhang 和 D. Metaxas,“量化密集连接的 U-Nets 以实现高效的地标定位”,Proc 中。 ECCV,2018 年,第 339–354 页。 -
[13] [13]
J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for robust facial
landmark localisation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit, 2017, pp. 79–87.
J. Yang、Q. Liu 和 K. Zhang,“用于强大面部标志定位的堆叠沙漏网络”,Proc. IEEE 会议计算。维斯。模式识别,2017 年,第 79-87 页。 -
[14] [14]
A. Bulat and G. Tzimiropoulos, “Super-FAN: Integrated facial landmark
localization and super-resolution of real-world low resolution faces in
arbitrary poses with GANS,” in n Proc. IEEE Conf. Comput. Vis.
Pattern Recognit, 2018, pp. 109–117.
A. Bulat 和 G. Tzimiropoulos,“Super-FAN:使用 GANS 对任意姿势的现实世界低分辨率人脸进行集成面部标志定位和超分辨率”,载于 n Proc。 IEEE 会议计算。维斯。模式识别,2018 年,第 109–117 页。 -
[15] [15]
J. Zhang, M. Liu, and D. Shen, “Detecting anatomical landmarks from limited
medical imaging data using two-stage task-oriented deep neural networks,”
IEEE Trans. on Imag. Processing, vol. 26, no. 10, pp. 4753–4764,
2017.
J. Zhang、M. Liu 和 D. Shen,“使用两阶段面向任务的深度神经网络从有限的医学成像数据中检测解剖标志”,IEEE Trans。在图像上。处理,卷。 26、没有。 10,第 4753–4764 页,2017 年。 -
[16] [16]
O. Emad, I. A. Yassine, and A. S. Fahmy, “Automatic localization of the left
ventricle in cardiac MRI images using deep learning,” in Proc.
EMBC. IEEE, 2015, pp. 683–686.
O. Emad、I. A. Yassine 和 A. S. Fahmy,“使用深度学习在心脏 MRI 图像中自动定位左心室”,Proc. EMBC。 IEEE,2015 年,第 683–686 页。 -
[17] [17]
L. et al, “Fast multiple landmark localisation using a patch-based iterative
network,” in Proc. MICCAI. Springer, 2018, pp. 563–571.
L. 等人,“使用基于补丁的迭代网络进行快速多地标定位”,Proc.米卡伊。施普林格,2018 年,第 563–571 页。 -
[18] [18]
Noothout et al., “CNN-based landmark detection in cardiac CTA
scans,” in MIDL Amsterdam, 2018, pp. 1–11.
Noothout 等人,“心脏 CTA 扫描中基于 CNN 的界标检测”,MIDL 阿姆斯特丹,2018 年,第 1-11 页。 -
[19] [19]
S. Ruder, “An overview of multi-task learning in deep neural networks,”
arXiv preprint arXiv:1706.05098, 2017.
S. Ruder,“深度神经网络中多任务学习的概述”,arXiv 预印本 arXiv:1706.05098,2017 年。 -
[20] [20]
J. M. H. Noothout, B. D. De Vos, J. M. Wolterink, E. M. Postma,
P. A. M. Smeets, R. A. P. Takx, T. Leiner, M. A. Viergever, and
I. Išgum, “Deep learning-based regression and classification for
automatic landmark localization in medical images,” IEEE Trans. Med.
Imag., vol. 39, no. 12, pp. 4011–4022, 2020.
J. M. H. Noothout、B. D. De Vos、J. M. Wolterink、E. M. Postma、P. A. M. Smeets、R. A. P. Takx、T. Leiner、M. A. Viergever 和 I. Išgum,“基于深度学习的医学图像中自动地标定位的回归和分类”,IEEE Trans 。医学。图像,卷。 39,没有。 12,第 4011–4022 页,2020 年。 -
[21] [21]
T. Nair, D. Precup, D. L. Arnold, and T. Arbel, “Exploring uncertainty
measures in deep networks for multiple sclerosis lesion detection and
segmentation,” IEEE Trans. Med. Imag., vol. 59, p. 101557, 2020.
T. Nair、D. Precup、D. L. Arnold 和 T. Arbel,“探索深度网络中多发性硬化症病变检测和分割的不确定性测量”,IEEE Trans。医学。图像,卷。 59,p。 101557, 2020。 -
[22] [22]
A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, and T. Kapur,
“Confidence calibration and predictive uncertainty estimation for deep
medical image segmentation,” IEEE Trans. Med. Imag., vol. 39, no. 12,
pp. 3868–3878, 2020.
A. Mehrtash、W. M. Wells、C. M. Tempany、P. Abolmaesumi 和 T. Kapur,“深度医学图像分割的置信度校准和预测不确定性估计”,IEEE Trans。医学。图像,卷。 39,没有。 12,第 3868–3878 页,2020 年。 -
[23] [23]
A. G. Wilson and P. Izmailov, “Bayesian deep learning and a probabilistic
perspective of generalization,” arXiv preprint arXiv:2002.08791,
2020.
A. G. Wilson 和 P. Izmailov,“贝叶斯深度学习和泛化的概率视角”,arXiv 预印本 arXiv:2002.08791,2020。 -
[24] [24]
C. Payer, M. Urschler, H. Bischof, and D. Štern, “Uncertainty estimation
in landmark localization based on gaussian heatmaps,” in Uncertainty
for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in
Biomedical Image Analysis. Springer,
2020, pp. 42–51.
C. Payer、M. Urschler、H. Bischof 和 D. Štern,“基于高斯热图的地标定位的不确定性估计”,《医学成像中机器学习安全利用的不确定性》和《生物医学图像分析中的图形》。施普林格,2020 年,第 42-51 页。 -
[25] [25]
J.-H. Lee, H.-J. Yu, M.-j. Kim, J.-W. Kim, and J. Choi, “Automated
cephalometric landmark detection with confidence regions using bayesian
convolutional neural networks,” BMC Oral Health, vol. 20, no. 1, pp.
1–10, 2020.
J.-H。李,H.-J。于,M.-j。金,J.-W. Kim 和 J. Choi,“使用贝叶斯卷积神经网络进行置信区域的自动头影测量标志检测”,BMC Oral Health,卷。 20、没有。 1,第 1-10 页,2020 年。 -
[26] [26]
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern
neural networks,” in International Conference on Machine
Learning. PMLR, 2017, pp. 1321–1330.
C.Guo、G.Pleiss、Y.Sun 和 K.Q.Weinberger,“现代神经网络的校准”,国际机器学习会议。 PMLR,2017 年,第 1321–1330 页。 -
[27] [27]
M. P. Naeini, G. F. Cooper, and M. Hauskrecht, “Obtaining well calibrated
probabilities using bayesian binning,” Twenty-Ninth AAAI Conference on
Artificial Intelligence, vol. 2015, pp. 2901–2907, 2015.
M. P. Naeini、G. F. Cooper 和 M. Hauskrecht,“使用贝叶斯分箱获得校准良好的概率”,第二十九届 AAAI 人工智能会议,第 1 卷。 2015 年,第 2901–2907 页,2015 年。 -
[28] [28]
B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate
multiclass probability estimates,” in Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining,
2002, pp. 694–699.
B. Zadrozny 和 C. Elkan,“将分类器分数转换为准确的多类概率估计”,第八届 ACM SIGKDD 国际知识发现和数据挖掘会议记录,2002 年,第 694-699 页。 -
[29] [29]
Hurdman et al., “ASPIRE registry: assessing the spectrum of pulmonary
hypertension identified at a REferral centre,” European Respiratory
Journal, vol. 39, no. 4, pp. 945–955, 2012.
Hurdman 等人,“ASPIRE 登记:评估转诊中心发现的肺动脉高压谱”,《欧洲呼吸杂志》,第 1 卷。 39,没有。 4,第 945–955 页,2012 年。 -
[30] [30]
T. M. Consortium, “Project MONAI,” Dec. 2021, available at
https://doi.org/10.5281/zenodo.4323059. [Online]. Available:
https://doi.org/10.5281/zenodo.4323059
T. M. Consortium,“MONAI 项目”,2021 年 12 月,可访问 https://doi.org/10.5281/zenodo.4323059。 [在线的]。可用:https://doi.org/10.5281/zenodo.4323059